Enabling cooperative multi-gpu tests on multi-gpu nodes#27986
Enabling cooperative multi-gpu tests on multi-gpu nodes#27986gshtras merged 2 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request enables cooperative multi-gpu tests by adding the container user to the render group for GPU device access and by removing the hardcoded HIP_VISIBLE_DEVICES=0 to allow tests to use multiple GPUs. The changes are correct in principle. However, I've pointed out a high-severity issue regarding the robustness of how the render group GID is obtained. My suggestions aim to make the script more robust and easier to debug by adding explicit checks and centralizing the logic, which also eliminates code duplication.
| @@ -186,6 +186,7 @@ if [[ $commands == *"--shard-id="* ]]; then | |||
| --device /dev/kfd $BUILDKITE_AGENT_META_DATA_RENDER_DEVICES \ | |||
| --network=host \ | |||
| --shm-size=16gb \ | |||
| --group-add $(getent group render | cut -d: -f3) \ | |||
There was a problem hiding this comment.
The command substitution $(getent group render | cut -d: -f3) can produce an empty string if the render group does not exist on the host, causing the docker run command to fail with a confusing error. Since this logic is duplicated in the else block, it's best to perform the check once before the if/else block, store the GID in a variable, and reuse it.
This improves robustness, provides clearer error messages, and avoids code duplication. For example:
# Before the if-else block
render_gid=$(getent group render | cut -d: -f3)
if [[ -z "$render_gid" ]]; then
echo "Error: 'render' group not found. This is required for GPU access." >&2
exit 1
fi
# ... then inside both docker run commands:
--group-add "$render_gid" \| @@ -217,8 +218,8 @@ else | |||
| --device /dev/kfd $BUILDKITE_AGENT_META_DATA_RENDER_DEVICES \ | |||
| --network=host \ | |||
| --shm-size=16gb \ | |||
| --group-add $(getent group render | cut -d: -f3) \ | |||
1257540 to
b570a7c
Compare
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com>
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com>
b570a7c to
0f7744b
Compare
…#27986) Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com>
…#27986) Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com>
Enabling cooperative multi-gpu tests on multi-gpu nodes.
Signed-off-by: Alexei V. Ivanov alexei.ivanov@amd.com